Search CORE

267 research outputs found

Exact Hypothesis Tests for Log-linear Models with exactLoglinTest

Author: Brian Caffo
Publication venue
Publication date
Field of study

This manuscript overviews exact testing of goodness of fit for log-linear models using the R package exactLoglinTest. This package evaluates model fit for Poisson log-linear models by conditioning on minimal sufficient statistics to remove nuisance parameters. A Monte Carlo algorithm is proposed to estimate P values from the resulting conditional distribution. In particular, this package implements a sequentially rounded normal approximation and importance sampling to approximate probabilities from the conditional distribution. Usually, this results in a high percentage of valid samples. However, in instances where this is not the case, a Metropolis Hastings algorithm can be implemented that makes more localized jumps within the reference set. The manuscript details how some conditional tests for binomial logit models can also be viewed as conditional Poisson log-linear models and hence can be performed via exactLoglinTest. A diverse battery of examples is considered to highlight use, features and extensions of the software. Notably, potential extensions to evaluating disclosure risk are also considered.

Research Papers in Economics

Fast, Exact Bootstrap Principal Component Analysis for p>1 million

Author: Caffo Brian
Fisher Aaron
Schwartz Brian
Zipunnikov Vadim
Publication venue
Publication date: 14/05/2014
Field of study

Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (

p

) is much larger than the number of subjects (

n

), the challenge of calculating and storing the leading principal components from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap principal components, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same

n

-dimensional subspace as the original sample. As a result, all bootstrap principal components are limited to the same

n

-dimensional subspace and can be efficiently represented by their low dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low dimensional coordinates, without calculating or storing the

p

-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram (EEG) recordings (

p=900

n=392

), and to a dataset of brain magnetic resonance images (MRIs) (

p\approx

3 million,

n=352

). For the brain MRI dataset, our method allows for standard errors for the first 3 principal components based on 1000 bootstrap samples to be calculated on a standard laptop in 47 minutes, as opposed to approximately 4 days with standard methods.Comment: 25 pages, including 9 figures and link to R package. 2014-05-14 update: final formatting edits for journal submission, condensed figure

arXiv.org e-Print Archive

CiteSeerX

Sparse Median Graphs Estimation in a High Dimensional Semiparametric Model

Author: Caffo Brian
Han Fang
Liu Han
Publication venue
Publication date: 11/10/2013
Field of study

In this manuscript a unified framework for conducting inference on complex aggregated data in high dimensional settings is proposed. The data are assumed to be a collection of multiple non-Gaussian realizations with underlying undirected graphical structures. Utilizing the concept of median graphs in summarizing the commonality across these graphical structures, a novel semiparametric approach to modeling such complex aggregated data is provided along with robust estimation of the median graph, which is assumed to be sparse. The estimator is proved to be consistent in graph recovery and an upper bound on the rate of convergence is given. Experiments on both synthetic and real datasets are conducted to illustrate the empirical usefulness of the proposed models and methods

arXiv.org e-Print Archive

Collection Of Biostatistics Research Archive

Fixed-width output analysis for Markov chain Monte Carlo

Author: Caffo Brian
Haran Murali
Jones Galin
Neath Ronald
Publication venue
Publication date: 01/01/2005
Field of study

Markov chain Monte Carlo is a method of producing a correlated sample in order to estimate features of a target distribution via ergodic averages. A fundamental question is when should sampling stop? That is, when are the ergodic averages good estimates of the desired quantities? We consider a method that stops the simulation when the width of a confidence interval based on an ergodic average is less than a user-specified value. Hence calculating a Monte Carlo standard error is a critical step in assessing the simulation output. We consider the regenerative simulation and batch means methods of estimating the variance of the asymptotic normal distribution. We give sufficient conditions for the strong consistency of both methods and investigate their finite sample properties in a variety of examples

arXiv.org e-Print Archive

CiteSeerX

Collection Of Biostatistics Research Archive

Joint Estimation of Multiple Graphical Models from High Dimensional Time Series

Author: Caffo Brian
Han Fang
Liu Han
Qiu Huitong
Publication venue
Publication date: 01/11/2013
Field of study

In this manuscript we consider the problem of jointly estimating multiple graphical models in high dimensions. We assume that the data are collected from n subjects, each of which consists of T possibly dependent observations. The graphical models of subjects vary, but are assumed to change smoothly corresponding to a measure of closeness between subjects. We propose a kernel based method for jointly estimating all graphical models. Theoretically, under a double asymptotic framework, where both (T,n) and the dimension d can increase, we provide the explicit rate of convergence in parameter estimation. It characterizes the strength one can borrow across different individuals and impact of data dependence on parameter estimation. Empirically, experiments on both synthetic and real resting state functional magnetic resonance imaging (rs-fMRI) data illustrate the effectiveness of the proposed method.Comment: 40 page

arXiv.org e-Print Archive

Collection Of Biostatistics Research Archive

A User-Friendly Introduction to Link-Probit-Normal Models

Author: Caffo Brian S.
Griswold Michael
Publication venue: Collection of Biostatistics Research Archive
Publication date: 18/08/2005
Field of study

Probit-normal models have attractive properties compared to logit-normal models. In particular, they allow for easy specification of marginal links of interest while permitting a conditional random effects structure. Moreover, programming fitting algorithms for probit-normal models can be trivial with the use of well-developed algorithms for approximating multivariate normal quantiles. In typical settings, the data cannot distinguish between probit and logit conditional link functions. Therefore, if marginal interpretations are desired, the default conditional link should be the most convenient one. We refer to models with a probit conditional link an arbitrary marginal link and a normal random effect distribution as link-probit-normal models. In this manuscript we outline these models and discuss appropriate situations for using multivariate normal approximations. Unlike other manuscripts in this area that focus on very general situations and implement Markov chain or MCEM algorithms, we focus on simpler, random intercept settings and give a collection of user-friendly examples and reproducible code. Marginally, the link-probit-normal model is obtained by a non-linear model on a discretized multivariate normal distribution, and thus can be thought of as a special case of discretizing a multivariate T distribution (as the degrees of freedom go to infinity). We also consider the larger class of multivariate T marginal models and illustrate how these models can be used to closely approximate a logit link

CiteSeerX

Collection Of Biostatistics Research Archive

A NOVEL AND SIMPLE RULE OF THUMB FOR MULTIPLICITY CONTROL IN EQUIVALENCE TESTING USING TWO ONE-SIDED TESTS

Author: Caffo Brian S.
Lauzon Carolyn
Publication venue: Collection of Biostatistics Research Archive
Publication date: 23/07/2008
Field of study

Equivalence testing is growing in use in scientific research outside of its traditional role in the drug approval process. Largely due to its ease of use and recommendation from the United States Food and Drug Administration guidance, the most common statistical method for testing (bio)equivalence is the two one-sided tests procedure (TOST). Like classical point-null hypothesis testing, TOST is subject to multiplicity concerns as more comparisons are made. In this manuscript, a condition that bounds the family-wise error rate (FWER) using TOST is given. This condition then leads to a simple solution for controlling the FWER. Specifically, we demonstrate that if all pairwise comparisons of k independent groups are being evaluated for equivalence, then simply scaling the nominal Type I error rate down by (k - 1) is sufficient to maintain the family-wise error rate at the desired value or less. The resulting rule is much less conservative than the equally simple Bonferroni correction. An example of equivalence testing in a non drug-development setting is given

Collection Of Biostatistics Research Archive